Embedded Unsupervised Feature Selection
نویسندگان
چکیده
Sparse learning has been proven to be a powerful technique in supervised feature selection, which allows to embed feature selection into the classification (or regression) problem. In recent years, increasing attention has been on applying spare learning in unsupervised feature selection. Due to the lack of label information, the vast majority of these algorithms usually generate cluster labels via clustering algorithms and then formulate unsupervised feature selection as sparse learning based supervised feature selection with these generated cluster labels. In this paper, we propose a novel unsupervised feature selection algorithm EUFS, which directly embeds feature selection into a clustering algorithm via sparse learning without the transformation. The Alternating Direction Method of Multipliers is used to address the optimization problem of EUFS. Experimental results on various benchmark datasets demonstrate the effectiveness of the proposed framework EUFS. Introduction In many real-world applications such as data mining and machine learning, one is often faced with high-dimensional data (Jain and Zongker 1997; Guyon and Elisseeff 2003). Data with high dimensionality not only significantly increases the time and memory requirements of the algorithms, but also degenerates many algorithms’ performance due to the curse of dimensionality and the existence of irrelevant, redundant and noisy dimensions(Liu and Motoda 2007). Feature selection, which reduces the dimensionality by selecting a subset of most relevant features, has been proven to be an effective and efficient way to handle highdimensional data (John et al. 1994; Liu and Motoda 2007). In terms of the label availability, feature selection methods can be broadly classified into supervised methods and unsupervised methods. The availability of the class label allows supervised feature selection algorithms (Duda et al. 2001; Nie et al. 2008; Zhao et al. 2010; Tang et al. 2014) to effectively select discriminative features to distinguish samples from different classes. Sparse learning has been proven to be a powerful technique in supervised feature selection (Nie et al. 2010; Gu and Han 2011; Tang and Liu 2012a), Copyright c © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. which enables feature selection to be embedded in the classification (or regression) problem. As most data is unlabeled and it is very expensive to label the data, unsupervised feature selection attracts more and more attentions in recent years (Wolf and Shashua 2005; He et al. 2005; Boutsidis et al. 2009; Yang et al. 2011; Qian and Zhai 2013; Alelyani et al. 2013). Without label information to define feature relevance, a number of alternative criteria have been proposed for unsupervised feature selection. One commonly used criterion is to select features that can preserve the data similarity or manifold structure constructed from the whole feature space (He et al. 2005; Zhao and Liu 2007). In recent years, applying sparse learning in unsupervised feature selection has attracted increasing attention. These methods usually generate cluster labels via clustering algorithms and then transform unsupervised feature selection into sparse learning based supervised feature selection with these generated cluster labels such as Multi-cluster feature selection (MCFS) (Cai et al. 2010), Nonnegative Discriminative Feature Selection (NDFS) (Li et al. 2012), and Robust Unsupervised Feature Selection (RUFS) (Qian and Zhai 2013). In this paper, we propose a novel unsupervised feature selection algorithm, i.e., Embedded Unsupervised Feature Selection (EUFS). Unlike existing unsupervised feature selection methods such as MCFS, NDFS or RUFS, which transform unsupervised feature selection into sparse learning based supervised feature selection with cluster labels generated by clustering algorithms, we directly embed feature selection into a clustering algorithm via sparse learning without the transformation (see Figure 1). This work theoretically extends the current state-of-the-art unsupervised feature selection, algorithmically expands the capability of unsupervised feature selection, and empirically demonstrates the efficacy of the new algorithm. The major contributions of this paper are summarized next. • Providing a way to directly embed unsupervised feature selection algorithm into a clustering algorithm via sparse learning instead of transforming it into sparse learning based supervised feature selection with cluster labels; • Proposing an embedded feature selection framework EUFS, which selects features in unsupervised scenarios with sparse learning; and (a) Existing sparse learning based unsupervised feature selection method (b) Embedded unsupervised feature selection Figure 1: Differences between the existing sparse learning based unsupervesd feature selection methods and the proposed embedded unsupervised feature selection • Conducting experiments on various datasets to demonstrate the effectiveness of the proposed framework EUFS. The rest of this paper is organized as follows. In Section 2, we give details about the embedded unsupervised feature selection framework EUFS. In Section 3, we introduce a method to solve the optimization problem of the proposed framework. In Section 4, we show empirical evaluation with discussion. In section 5, we present the conclusion with future work. Embedded Unsupervised Feature Selection Throughout this paper, matrices are written as boldface capital letters and vectors are denoted as boldface lowercase letters. For an arbitrary matrix M ∈ Rm×n, Mij denotes the (i, j)-th entry of M while mi and m mean the i-th row and j-th column of M respectively. ||M||F is the Frobenius norm of M and Tr(M) is the trace of M if M is square. 〈A,B〉 equals Tr(AB), which is the standard inner product between two matrices. I is the identity matrix and 1 is a vector whose elements are all 1. The l2,1-norm is defined as ||M||2,1 = ∑m i=1 ||mi|| = ∑m i=1 √∑n j=1 M 2 ij). Let X ∈ RN×d be the data matrix with each row xi ∈ R1×d being a data instance. We use F = {f1, . . . , fd} to denote the d features and f1, . . . , fd are the corresponding feature vectors. Assume that each feature has been normalized, i.e., ||fj ||2 = 1 for j = 1, . . . , d. Suppose that we want to cluster X into k clusters (C1, C2, . . . , Ck) under the matrix factorization framework as: min U,V ||X−UV ||F s.t.U ∈ {0, 1}N×k,UT1 = 1 (1) where U ∈ RN×k is the cluster indicator and V ∈ Rd×k is the latent feature matrix. The problem in Eq.(1) is difficult to solve due to the constraint on U. Following the common relaxation for label indicator matrix (Von Luxburg 2007; Tang and Liu 2012b), the constraint on U is relaxed to orthogonality, i.e., UU = I, U ≥ 0. After the relaxation, Eq.(1) can be rewritten as:
منابع مشابه
Reconstruction-based Unsupervised Feature Selection: An Embedded Approach
Feature selection has been proven to be effective and efficient in preparing high-dimensional data for data mining and machine learning problems. Since real-world data is usually unlabeled, unsupervised feature selection has received increasing attention in recent years. Without label information, unsupervised feature selection needs alternative criteria to define feature relevance. Recently, d...
متن کاملA Gaussian Mixture Model to Detect Clusters Embedded in Feature Subspace
The goal of unsupervised learning, i.e., clustering, is to determine the intrinsic structure of unlabeled data. Feature selection for clustering improves the performance of grouping by removing irrelevant features. Typical feature selection algorithms select a common feature subset for all the clusters. Consequently, clusters embedded in different feature subspaces are not able to be identified...
متن کاملFeature selection for semi-supervised data analysis in decisional information systems. (Sélection de variables pour l'analyse semi-supervisées des données dans les systèmes d'Information décisionnels)
Feature selection is an important task in data mining and machine learning processes. This task is well known in both supervised and unsupervised contexts. The semi-supervised feature selection is still under development and far from being mature. In general, machine learning has been well developed in order to deal with partially-labeled data. Thus, feature selection has obtained special impor...
متن کاملUnsupervised feature evaluation: a neuro-fuzzy approach
The present article demonstrates a way of formulating neuro-fuzzy approaches for both feature selection and extraction under unsupervised learning. A fuzzy feature evaluation index for a set of features is defined in terms of degree of similarity between two patterns in both the original and transformed feature spaces. A concept of flexible membership function incorporating weighted distance is...
متن کاملFeature Selection for Unsupervised and Supervised Inference: the Emergence of Sparsity in a Weighted-based Approach
The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science including — examples in computer vision, text processing and more recently bioinformatics are abundant. In this work we present a definition of ”relevancy” based on spectral properties of the Affinity (or Laplacian) of the features’ measurement m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015